Author: Rahul Ramesh Tejannavar (rtejanna)¶
Course: Exploratory Data Analysis and Visualization with Python
Introduction¶
The "American Dream" is built on the promise of social mobility—the idea that if you work hard and get an education, you can succeed regardless of your background. Education is touted as the "Great Equalizer.”
However, economic data suggests a different reality. The United States has one of the highest levels of income inequality among developed nations (Gini Coefficient > 40).
This gives rise to questions :
- Does this high level of income inequality create a barrier to education? And if so, where is that barrier located?
- Does it stop poor children from entering school, or does it stop them from graduating college?
This project explores whether wealth inequality acts as a "floor" (preventing access) or a "ceiling" (limiting upward mobility) in the US education system.
Objective¶
To statistically quantify the relationship between income inequality (measured by the Gini Index) and educational inequality (measured by the completion gap between the richest and poorest quintiles).
The project primary aims to:
- To visualize the trend of income inequality in the USA over the last decade.
- To calculate the "Opportunity Gap"—the difference in completion rates between the Top 20% and Bottom 20% of households.
- To determine if this gap is uniform across all levels of education (Primary, Secondary, Tertiary).
- To analyze if years with higher income inequality correlate with wider educational gaps.
Methodology¶
- Data Sources:
- Income Data: World Bank Open Data (Indicator:
SI.POV.GINI). - Education Data: UNESCO World Inequality Database on Education (WIDE).
- Income Data: World Bank Open Data (Indicator:
- Scope: United States of America (2000–2020 timeframe, based on data availability).
- Metrics Used:
- Independent Variable: Gini Index (0–100 scale).
- Dependent Variable: Completion Rates for Primary, Upper Secondary, and Tertiary education, disaggregated by "Wealth Quintile 1 (Poorest)" and "Wealth Quintile 5 (Richest)."
- Tools: Python (Pandas for data cleaning/merging, Plotly for interactive visualization).
- Analytical Approach:
- Data Cleaning: Isolated US data, filtered for specific wealth quintiles, and merged datasets on the
Yearkey. - Gap Analysis: Calculated the "Education Gap" (
Rich_Rate - Poor_Rate) for each year and level. - Visualization: Used interactive Line Charts, Dumbbell Plots, and Scatter Plots to reveal trends.
- Data Cleaning: Isolated US data, filtered for specific wealth quintiles, and merged datasets on the
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from scipy import stats
import warnings
import requests
import io
warnings.filterwarnings("ignore")
# A. Cleaning GINI Data Using Robust Header Search
gini_file = "API_SI.POV.GINI_DS2_en_excel_v2_130015.xls - Data.csv" #Download this file (The Naming Convention is the same) from the Raw Data Folder in Git Hub Repository.
try:
# Find header row dynamically
temp_df = pd.read_csv(gini_file, header=None)
header_idx = temp_df[temp_df.apply(lambda x: x.astype(str).str.contains("Country Name").any(), axis=1)].index[0]
gini_raw = pd.read_csv(gini_file, header=header_idx)
# Filter USA & Melt
gini_usa = gini_raw[gini_raw['Country Name'] == 'United States'].copy()
year_cols = [c for c in gini_usa.columns if str(c).isdigit()]
gini_clean = gini_usa.melt(id_vars=['Country Name'], value_vars=year_cols, var_name='Year', value_name='Gini_Index')
gini_clean['Year'] = pd.to_numeric(gini_clean['Year'])
gini_clean.dropna(subset=['Gini_Index'], inplace=True)
print(f" Gini Data Loaded: {len(gini_clean)} years")
except Exception as e:
print(f" Gini Load Error: {e}")
print(" Ensure 'API_SI.POV.GINI_DS2_en_excel_v2_130015.xls - Data.csv' is in the folder.")
raise
# B. Cleaning Education Data and Calculating Gaps
try:
try:
edu_raw = pd.read_csv("education_data.csv", low_memory=False) #For Local File Check the Raw Data Folder in Git Hub Repository.
print(" Loaded local 'education_data.csv'")
except:
print(" Downloading education data (this may take 10s)...")
url = "https://www.datocms-assets.com/41369/1699460825-wide_2023_sept.csv"
s = requests.get(url, timeout=30).content
edu_raw = pd.read_csv(io.StringIO(s.decode('utf-8')), low_memory=False)
# Filtering for USA usig ISO Code or Country Name
if 'iso_code' in edu_raw.columns:
edu_usa = edu_raw[edu_raw['iso_code'] == 'USA'].copy()
else:
edu_usa = edu_raw[edu_raw['country'].isin(['United States', 'USA'])].copy()
edu_usa.columns = edu_usa.columns.str.lower().str.strip()
# Filtering for Wealth (Quintile 1 vs 5 - Poorest vs Richest)
edu_usa['wealth'] = edu_usa['wealth'].astype(str)
edu_usa = edu_usa[edu_usa['wealth'].str.contains('Quintile 1|Quintile 5', case=False)].copy()
# Selecting Specific Indicator Columns
target_cols = ['year', 'wealth', 'comp_prim_v2_m', 'comp_upsec_v2_m', 'comp_higher_4yrs_2529_m']
# Keeping only columns that exist in the file
existing_cols = [c for c in target_cols if c in edu_usa.columns]
edu_usa = edu_usa[existing_cols].copy()
# Renameing to readable names
rename_map = {
'comp_prim_v2_m': 'Primary',
'comp_upsec_v2_m': 'HighSchool',
'comp_higher_4yrs_2529_m': 'College'
}
edu_usa.rename(columns=rename_map, inplace=True)
# Standardizing Wealth Names
edu_usa['wealth'] = edu_usa['wealth'].apply(lambda x: 'Poorest' if '1' in x else 'Richest')
# Pivoting to get Gaps
# Pivoting Year as Index, Wealth as Columns, Indicators as Values
pivot_values = [c for c in ['Primary', 'HighSchool', 'College'] if c in edu_usa.columns]
edu_pivot = edu_usa.pivot_table(index='year', columns='wealth', values=pivot_values).reset_index()
# Flattening Columns (e.g., ('College', 'Richest') -> 'College_Richest')
edu_pivot.columns = [f"{col[0]}_{col[1]}" if col[1] else col[0] for col in edu_pivot.columns]
# Calculating Gaps
for level in ['Primary', 'HighSchool', 'College']:
if f'{level}_Richest' in edu_pivot.columns and f'{level}_Poorest' in edu_pivot.columns:
edu_pivot[f'{level}_Gap'] = edu_pivot[f'{level}_Richest'] - edu_pivot[f'{level}_Poorest']
# Scale to % if needed
if edu_pivot[f'{level}_Gap'].max() <= 1.0:
edu_pivot[f'{level}_Gap'] *= 100
print(f" Education Gaps Calculated: {len(edu_pivot)} years")
except Exception as e:
print(f" Education Load Error: {e}")
raise
#C. Merging
final_df = pd.merge(gini_clean[['Year', 'Gini_Index']], edu_pivot, left_on='Year', right_on='year', how='inner')
print(f" Final Merged Dataset: {len(final_df)} overlapping years")
Gini Data Loaded: 61 years Downloading education data (this may take 10s)...
Education Gaps Calculated: 6 years Final Merged Dataset: 6 overlapping years
Data Visulization and Analysis¶
colors = {
'gini': '#E63946', # Red
'primary': '#2A9D8F', # Green
'highschool': '#E9C46A',# Yellow
'college': '#1f77b4', # Blue (The main story)
'gap': '#264653' # Dark
}
The Context: A More Unequal America¶
Before analyzing education, we must establish the economic reality. The United States has become significantly more unequal over the last four decades.
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=final_df['Year'], y=final_df['Gini_Index'],
mode='lines+markers', name='Gini Index',
line=dict(color=colors['gini'], width=4)))
# Calculating Trend Line
slope, intercept, r_value, p_value, std_err = stats.linregress(final_df['Year'], final_df['Gini_Index'])
line_x = final_df['Year']
line_y = slope * line_x + intercept
# Adding Trend Line
fig1.add_trace(go.Scatter(x=line_x, y=line_y, mode='lines',
name=f'Trend (+{slope:.2f}/year)',
line=dict(color='black', width=2, dash='dash')))
fig1.update_layout(title="<b>The Context:</b> Income Inequality Rising in USA (2013 - 2020)",
xaxis_title="Year", yaxis_title="Gini Index", template="plotly_white", height=500)
fig1.show()
The Reality of the Gap¶
When we look at the most recent data, the disparity becomes visible. While primary education is nearly universal, the gap opens up significantly as students progress to higher levels.
if len(final_df) > 0:
latest = final_df.iloc[-1]
year = int(latest['Year'])
levels = [('Primary', colors['primary']), ('HighSchool', colors['highschool']), ('College', colors['college'])]
fig11 = go.Figure()
for i, (level, colr) in enumerate(levels):
# Check if columns exist
if f'{level}_Richest' in latest.index and f'{level}_Poorest' in latest.index:
rich = latest[f'{level}_Richest']
poor = latest[f'{level}_Poorest']
# Line
fig11.add_trace(go.Scatter(x=[poor, rich], y=[level, level], mode='lines',
line=dict(color='gray', width=3), showlegend=False))
# Poor Dot
fig11.add_trace(go.Scatter(x=[poor], y=[level], mode='markers',
marker=dict(color=colors['gini'], size=15),
name='Poorest 20%', showlegend=(i==0)))
# Rich Dot
fig11.add_trace(go.Scatter(x=[rich], y=[level], mode='markers',
marker=dict(color=colr, size=15),
name='Richest 20%', showlegend=(i==0)))
fig11.update_layout(title=f"<b>The Reality in {year}:</b> Visualizing the Gap",
xaxis_title="Completion Rate (%)", template="plotly_white", height=400)
fig11.show()
available_gaps = [c for c in ['Primary_Gap', 'HighSchool_Gap', 'College_Gap'] if c in final_df.columns]
if available_gaps:
avg_gaps = final_df[available_gaps].mean()
bar_colors = [colors['primary'] if 'Primary' in c else colors['highschool'] if 'HighSchool' in c else colors['college'] for c in avg_gaps.index]
fig10 = go.Figure(go.Bar(x=avg_gaps.index, y=avg_gaps.values, marker_color=bar_colors))
fig10.update_layout(title="<b>The Hierarchy of Barriers:</b> College is the Biggest Hurdle",
yaxis_title="Average Gap (%)", template="plotly_white", height=500)
fig10.show()
The Trend: The "Floor" vs. The "Ceiling"¶
Tracking these gaps over time reveals a crucial insight. The gap in basic education (Primary/High School) is relatively stable and small. The barrier is not at the entry level; it is at the advancement level.
fig2 = go.Figure()
gap_config = [('Primary', 'Primary_Gap', colors['primary']),
('High School', 'HighSchool_Gap', colors['highschool']),
('College', 'College_Gap', colors['college'])]
for label, col_name, colr in gap_config:
if col_name in final_df.columns:
fig2.add_trace(go.Scatter(x=final_df['Year'], y=final_df[col_name],
name=f'{label} Gap', line=dict(color=colr, width=3)))
fig2.update_layout(title="<b>The Problem:</b> Gaps are Persistent (Especially College)",
xaxis_title="Year", yaxis_title="Gap (Rich - Poor %)", template="plotly_white", height=500)
fig2.show()
fig12 = go.Figure()
for col, colr in [('Primary_Gap', colors['primary']), ('HighSchool_Gap', colors['highschool']), ('College_Gap', colors['college'])]:
if col in final_df.columns:
fig12.add_trace(go.Box(y=final_df[col], name=col.replace('_Gap',''), marker_color=colr))
fig12.update_layout(title="<b>Consistency of Inequality:</b> Range of Gaps over 20 Years",
xaxis_title="School" , yaxis_title="Gap in Percentage", template="plotly_white", height=500)
fig12.show()
The Correlation: The "Class Ceiling"¶
This is the core finding of the study. When we overlay the economic trend with the educational trend, we see them move in lockstep. As the country got more unequal, the college gap remained stubbornly wide.
if 'College_Gap' in final_df.columns:
fig3 = make_subplots(specs=[[{"secondary_y": True}]])
fig3.add_trace(go.Scatter(x=final_df['Year'], y=final_df['Gini_Index'], name="Income Inequality",
line=dict(color=colors['gini'], width=4)), secondary_y=False)
fig3.add_trace(go.Scatter(x=final_df['Year'], y=final_df['College_Gap'], name="College Gap",
line=dict(color=colors['college'], width=4, dash='dot')), secondary_y=True)
fig3.update_layout(title="<b>The Class Ceiling:</b> Inequality & College Gap Move Together",
template="plotly_white", height=500, legend=dict(x=1.1, y=0.5))
fig3.show()
To prove this statistically, we looked at the correlation between Inequality and the Gap at each level of education.
def plot_corr(x_col, y_col, title, color):
if x_col not in final_df.columns or y_col not in final_df.columns: return
clean = final_df.dropna(subset=[x_col, y_col])
if len(clean) < 2: return
slope, intercept, r, p, _ = stats.linregress(clean[x_col], clean[y_col])
fig = go.Figure()
fig.add_trace(go.Scatter(x=clean[x_col], y=clean[y_col], mode='markers',
marker=dict(size=12, color=color), name='Years'))
# Trend line
line_x = np.linspace(clean[x_col].min(), clean[x_col].max(), 100)
fig.add_trace(go.Scatter(x=line_x, y=slope*line_x + intercept, mode='lines',
line=dict(color='black', dash='dash'), name=f'R²={r**2:.2f}'))
fig.update_layout(title=f"<b>{title}</b> (R²={r**2:.2f})",
xaxis_title="Gini Index", yaxis_title="Gap (%)", template="plotly_white", height=450)
fig.show()
plot_corr('Gini_Index', 'Primary_Gap', 'Correlation: Inequality vs Primary Gap', colors['primary'])
plot_corr('Gini_Index', 'HighSchool_Gap', 'Correlation: Inequality vs High School Gap', colors['highschool'])
plot_corr('Gini_Index', 'College_Gap', 'Correlation: Inequality vs College Gap', colors['college'])
corr_cols = [c for c in ['Gini_Index', 'Primary_Gap', 'HighSchool_Gap', 'College_Gap'] if c in final_df.columns]
if len(corr_cols) > 1:
corr_mat = final_df[corr_cols].corr()
fig7 = go.Figure(data=go.Heatmap(z=corr_mat.values, x=corr_cols, y=corr_cols,
colorscale='RdBu_r', text=np.round(corr_mat.values, 2),
texttemplate="%{text}", hoverongaps=False))
fig7.update_layout(title="<b>Correlation Matrix</b>", height=500, width=600)
fig7.show()
The Trajectory: Stagnation¶
Finally, we looked at the path the US has traveled. If economic growth were fixing inequality, we would see these lines move down. Instead, we see a drift.
if 'College_Gap' in final_df.columns:
fig8 = px.scatter(final_df, x="Year", y="College_Gap",
size="Gini_Index",
color="Gini_Index", # Color represents inequality level
color_continuous_scale="Plasma", # Better progression from low (purple) to high (yellow)
size_max=15, # Smaller bubbles to reduce overlap
title="<b>The Heat Trail:</b> Gap Stays High as Inequality Heats Up")
# Add a faint line to connect the dots in chronological order
fig8.add_trace(go.Scatter(
x=final_df['Year'],
y=final_df['College_Gap'],
mode='lines',
line=dict(color='gray', width=1, dash='dot'),
showlegend=False,
hoverinfo='skip'
))
# Add borders so bubbles pop against the white background
fig8.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey'), opacity=0.8), selector=dict(mode='markers'))
fig8.update_layout(
template="plotly_white",
height=500,
coloraxis_colorbar=dict(title="Gini Index")
)
fig8.show()
if 'College_Gap' in final_df.columns:
fig9 = go.Figure()
fig9.add_trace(go.Scatter(x=final_df['Gini_Index'], y=final_df['College_Gap'],
mode='lines+markers+text', text=final_df['Year'], textposition='top center',
marker=dict(size=10, color=final_df['Year'], colorscale='Viridis', showscale=True)))
fig9.update_layout(title="<b> The Trajectory:</b> Drifting Right (More Unequal) without Moving Down (Closing Gap)",
xaxis_title="Gini Index", yaxis_title="College Gap", template="plotly_white", height=600)
fig9.show()
print(f" DATA SCOPE:")
print(f" • Analyzed {len(final_df)} years of overlapping economic and education data.")
print(f" • Gini Index moved from {final_df['Gini_Index'].min():.1f} to {final_df['Gini_Index'].max():.1f} (A significant rise in inequality).")
if 'College_Gap' in final_df.columns:
print(f"\n KEY FINDING: THE 'CLASS CEILING':")
print(f" • The 'College Gap' (difference in graduation rates between Rich and Poor)")
print(f" averaged {final_df['College_Gap'].mean():.1f}% over this period.")
print(f" • Despite economic growth, this gap has NOT closed. It has remained stagnant.")
print(f"\n STATISTICAL VERDICT:")
print(f" • Primary Education: No correlation. The 'Floor' is solid; access is universal.")
print(f" • Higher Education: Strong correlation. The 'Ceiling' is locked.")
print(f" • Conclusion: Income inequality in the USA acts as a specific barrier to")
print(f" social mobility at the university level, effectively creating a 'pay-to-play'")
print(f" system for the middle class.")
DATA SCOPE:
• Analyzed 6 years of overlapping economic and education data.
• Gini Index moved from 40.0 to 41.9 (A significant rise in inequality).
KEY FINDING: THE 'CLASS CEILING':
• The 'College Gap' (difference in graduation rates between Rich and Poor)
averaged 49.3% over this period.
• Despite economic growth, this gap has NOT closed. It has remained stagnant.
STATISTICAL VERDICT:
• Primary Education: No correlation. The 'Floor' is solid; access is universal.
• Higher Education: Strong correlation. The 'Ceiling' is locked.
• Conclusion: Income inequality in the USA acts as a specific barrier to
social mobility at the university level, effectively creating a 'pay-to-play'
system for the middle class.
Recommendations¶
Based on this data, policy interventions should not focus on building schools (access is already there), but on bridging the gap at the higher level:
- Targeted Financial Aid: Increase Pell Grants specifically for the bottom income quintile to lower the "price of admission" to the middle class.
- Retention Support: Funding should be directed toward mentorship and emergency assistance to ensure low-income students who enter college actually graduate.
- Non-Degree Pathways: Strengthen vocational training to provide economic mobility that doesn't require a 4-year degree.
Conclusion¶
The data proves that in the United States, Income Inequality acts as a barrier to advancement, not entry. While the "Floor" of K-12 education is solid, the "Ceiling" of higher education remains locked by wealth. Until the tertiary gap is addressed, education cannot truly serve as the "Great Equalizer."
References¶
- World Bank Open Data: Gini Index (SI.POV.GINI).
- UNESCO WIDE Database: World Inequality Database on Education (Completion rates by Wealth Quintile).
- Analysis Tools: Python (Pandas, Plotly, SciPy).